
---
title: "Taboo Subjects Online Supplement - Descriptive Statistics and Parametric Assumptions"
author: "Kaylea Champion and Benjamin Mako Hill"
affiliation: "University of Washington"
date: "8/5/2023"
output: pdf_document
---
  
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library('ggplot2')
library('scales')
library('xtable')
library('knitr')
setwd("~/Research/taboo/StepZRAnalysis")
load('../processed_data/artDF.RData')
load('../knitr_rdata/knitr_data.RData')
load('~/Research/taboo/processed_data/EDA.RData') 
#need vDF, qDF, revDF.clean

reverselog_trans <- function(base = exp(1)) {
  trans <- function(x) -log(x, base)
  inv <- function(x) base^(-x)
  trans_new(paste0("reverselog-", format(base)), trans, inv, 
            log_breaks(base = base), 
            domain = c(1e-100, Inf))
}

options(scipen=999)
```

## Overview

The purpose of this study is to understand the production and development of articles on taboo subjects.

### Independent Variables
- taboo (article)

### Dependent Variables
- article readership (h1)
- contribution quantity (h2)
- contribution quality (h3)
- article quality (h4)
- contributor identifiability (h5)

### Important Controls

- article protection (makes contribution impossible for newcomers, non-account contributors)

### Relationships to be Tested

- H1 - taboo subjects have higher readership than comparable articles
- H2 - taboo subjects receive lower quantity of contributions than comparable articles
- H3 - taboo subjects receive lower quality contributions than comparable articles
- H4 - taboo subjects are lower quality than comparable articles
- H5 - taboo subjects are more likely to receive contributions from less identifiable contributors than comparable articles

### Units of Analysis
- The article (h1, h4)
- The revision of the article (h2, h3)
- The user and the revision (h5)

## H1 Descriptives: Viewership

View rank is not normally distributed, even when logged. We therefore use the Mann Whitney U test to evaluate whether the observed differences are significant.
```{r, H1descriptives,fig.keep='all'}
hist(vDF$viewRankInMyMonth)
hist(log(vDF$viewRankInMyMonth))
ks.test(log(vDF$viewRankInMyMonth), "pnorm")
viewerTest <- wilcox.test(viewRankInMyMonth ~ source, data=vDF, exact=FALSE)
viewerTest
```

## H2 Descriptives: Contribution Quantity
Contribution quantity is not normally distributed, even when logged. We therefore use the Mann Whitney U test to evaluate whether the observed differences are significant.
```{r H2descriptives}
hist(artDF$revid) #number of revisions
hist(log(artDF$revid))
ks.test(log(artDF$revid), "pnorm")
quantTest <- wilcox.test(revid ~ source, data=artDF, exact=FALSE)
quantTest


```

## H3 Descriptives: Contribution Quality

Contribution quality is not normally distributed, even when logged. We therefore use the Mann Whitney U test to evaluate whether the observed differences are significant.
```{r H3descriptives}
##revision quality
hist(artDF$pct_revert) #number of reverted edits
hist(log(artDF$pct_revert))
ks.test(log(artDF$pct_revert), "pnorm")
qualTest.revert <- wilcox.test(pct_revert ~ source, data=artDF, exact=FALSE)
qualTest.revert

hist(artDF.dmg$pct_dmg) #number of damaging edits
hist(log(artDF.dmg$pct_dmg))
ks.test(log(artDF.dmg$pct_dmg), "pnorm")
qualTest.dmg <- wilcox.test(pct_dmg ~ source, data=artDF.dmg, exact=FALSE)
qualTest.dmg

```

These results preliminarily suggest a log-log linear model. However, we observe that we do not have support in both samples across the full range of the data. The random sample contains a large number of zeroes.

```{r H3support}

g <- ggplot(artDF, aes(group=source, x=revid, y=pct_revert, color=source)) +
  geom_point(alpha=.2) +
  geom_smooth() +
  geom_rug(alpha=.2)+
  scale_color_discrete(name="Source", labels=c("Comparison", "Taboo")) +
  #scale_x_continuous() +
  scale_y_continuous() +
  theme_bw() +
  theme(legend.position = 'bottom', legend.title = element_blank()) +
  #stat_summary(fun=mean, geom="point", shape=24, size=3, color="black", fill="black") +
  scale_x_continuous(trans='log') +
  #scale_y_continuous(trans='log') +
  labs(x='Number of Revisions', y='Proportion Reverted')

g

```


## H4 Descriptives: Article Quality

Article quality is not normally distributed, even when logged. We therefore use the Mann Whitney U test to evaluate whether the observed differences are significant.
```{r H4descriptivesp2}
## relative quality

hist(qDF$weighted_sum)
ks.test(qDF$weighted_sum, "pnorm")
qualityTest <- wilcox.test(weighted_sum ~ source, data=qDF, exact=FALSE)
qualityTest
```



## H5 Descriptives: Contributor Identifiability

H5 tests five ways contributors can limit their identifiability: 

- contributing without an account
- contributing with a new account (or, being a relative newcomer)
- having a short user page (accountholders only)
- specifying a gender (accountholders only)
- being emailable (accountholders only)

Although the paper focuses on a user-level dataset for reporting on these subjects, a revision-level dataset (although suffering from repeated measures issues) offers additional insight.

### Contributing without an account

We also examined these contributions in a cross-tabulated manner, examining the proportion of contributions that were reverted when contributors are divided by whether they are using an account and to which sample the contributed-to article belongs:

```{r}

revDF.xtab <- xtabs(~source+anon+got_reverted, data=revDF.clean)
prop.table(ftable(revDF.xtab), margin=1)

```



### Contributing with a new account


```{r H5descriptives}

g <- ggplot(revDF.clean, aes(x=log(editor_nth_edit_nocollapse))) +
  geom_histogram() +
  facet_grid(source~., scales='free_y')

g

g <- ggplot(revDF.clean, aes(x=editor_nth_edit_nocollapse)) +
  geom_boxplot() +
  facet_grid(source~., scales='free_y')

g

t <- subset(revDF.clean, as.logical(revDF.clean$anon)==TRUE)
mean(t$editor_nth_edit_nocollapse)

t <- subset(revDF.clean, as.logical(revDF.clean$anon)==FALSE)
mean(t$editor_nth_edit_nocollapse)

g <- ggplot(revDF.clean, aes(x=log(editor_nth_edit_nocollapse))) +
  geom_boxplot() +
  facet_grid(source~., scales='free_y')

g

```



### Having a user page

This is a revision-level view of the presence of a userpage at varying lengths, including repeated measures for contributors. 

```{r}

g <- ggplot(revDF.clean, aes(x=log1p(userpage_text_chars), group=source)) +
  geom_histogram(binwidth = .5) +
  facet_grid(source~., scales='free_y')

g

g <- ggplot(subset(revDF.clean, revDF.clean$userpage_text_chars < exp(4)), aes(x=log1p(userpage_text_chars), group=source)) +
  geom_histogram(binwidth = .5) +
  facet_grid(source~., scales='free_y')

g



```



### Gender

I used the API to find out a bit of user information -- are they emailable, whether gender was specified, and if it was specified, whether the person was female.

```{r gender user data}

table(revDF.clean$gender)
userDF <- data.frame('editor'=revDF.clean$editor, 'gender' = revDF.clean$gender, 'emailable' = revDF.clean$emailable, 'anon'=revDF.clean$anon)

userDF <- subset(userDF, userDF$anon == 'false')
userDF <- unique(userDF)
table(userDF$gender)

prop.table(table(revDF.clean$source, revDF.clean$gender), margin = 1)
gaveGenderDF <- subset(revDF.clean, revDF.clean$gender != 'unknown')
prop.table(table(gaveGenderDF$source, gaveGenderDF$gender), margin = 1)

```


### Being Emailable

User-level emailability is reported in the paper, however the frequencies in a revision level dataset show some variation from that trend.

```{r emailable user data}

table(revDF.clean$emailable)

table(userDF$emailable)

prop.table(table(revDF.clean$source, revDF.clean$emailable), margin = 1)

```

### Page Protection

```{r protection}

g <- ggplot(artDF.prot, aes(x=pct.prot, group=source)) +
  geom_boxplot() + 
  labs(x='Protection Proportion')

g

t.artDF <- subset(artDF.prot, artDF.prot$pct.prot > 0)

g <- ggplot(t.artDF, aes(x=pct.prot, group=source)) +
  geom_boxplot() + 
  labs(x='Protection Proportion (non-zero only)')

g


```
